Analysis of Gynecological Research Articles

INFO 526 - Summer 2025 - Final Project

A review of racial and ethnic disparities in research articles as well as a comparison of research lengths beetween journals.
Author
Affiliation

Lucas Smith

School of Information, University of Arizona

Abstract

This project investigates gynecological journals across time, reviewing racial and ethnic inclusion as well as journal publication frequency. Initial findings suggest no change of race and ethnicity inclusion over time. Further investigation would be required to identify validity and

Pre-formatting Data

I will load data first to make it easier to see the visualizations in the write-up.

Intro

  • The aim for this project was to review research articles from gynecological journals to identify patterns and trends, notably across time and journal. Data was taken from TidyTuesday. The data itself contains 318 observations, with each one having 65 features; data ranges in time from 2010 to 2023. Features contain basic journal information such as the journal, article name, year of publication, and the racial makeup of the article. With a background as a software engineer in the healthcare industry, I was interested to see how this data would change over time. Two questions were posed: How ethnic and racial inclusion has changed across time, and how study length would differ between journals.

Question 1

I chose this question for a few reasons. Race and ethnicity have been historically challenging to balance in research. I also know from background in medicine and previous learning in data science that skewed data can cause imbalance and misunderstanding of underlying trends. It would be interesting to see exactly how race and ethnicity have changed over the years when included in journals.

To answer this question, requires cleansing data properly for each race, as each journal and each article within each journal did not standardize a race and ethnicity. To do so, I pivoted the data longer such that I could aggregate the data. However, I found that not every journal article standardized how they were reporting on race. As such, I considered using a regex like function to map values to their appropriate races, but found it too risky to do so, for fear of misrepresenting a stratification. Instead, I manually mapped them.

The first plot I decided to make is a line plot over time. I found that I needed to create facets for each graph to better see trends within each race over time.

Discussion

We find that over time there is no significant change between the racial groups. There is a high change in 2020, where the number of unknown racial/ethnicity rises to 98%. I believe this to be an anomoly due to Covid. The population under white racial group is trending around the same. The years 2010-2012 have a high percentage, and then trend downward, however the percentage rises again towards 2021.

Question 2

How is the length of a study noticeable between different journals?

This question could further insight into racial populations among different journals. I was interested in this question to first identify how journals themselves differed - as I have never invested much time into writing research articles themselves, it seems plausible that different journals would be used for different occasions. To answer this question, we would only need the start and end date of each journal, along with the journal that the article was published to.

The transformation required to answer this question was to subtract the end year minus the start year. Acknowledging that this would produce some that have a 0 year length, meaning that they would have started and ended in the same year. The first graph to answer this question was to facet by journal. I decided to use contour mappings to identify and differentiate; the main purpose of doing so was to not only identify central locations for each journal, but also view a graph type not commonly seen. The second graph was a box plot identifying the median.

Discussion

I found that there was a difference between articles. As expected, longer standing article journals do not have any data past around 2010, because they publish on a 10 year basis. It was interesting to see that the Human Reproduction journal very clearly publishes every 2 years. Gynecologic Oncology was the highest, with some research journals listing more than 30 years.